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Abstract 


This  is  a study  of  a random  walk  on  the  nonnegative  integers 
whose  steps  are  controlled  as  follows.  Upon  arriving  at  a location 
1,  a pair  of  probabilities  (p,q)  is  selected  from  a prescribed  set, 
a reward  r(i,p,q)  is  received,  and  the  next  step  takes  the  walk  to 
locations  i+1,  1-1  or  i,  with  respective  probabilities  p,  q and  1-p-q 
(when  1=0  these  probabilities  are  p,  0 and  1-p) . This  is  repeated 
indefinitely.  A rule  for  successively  selecting  the  probabilities 
(p,q)  is  a control  policy.  We  identify  conditions  on  the  rewards  and 
probabilities  under  which  there  exist  monotonic  optimal  policies  for 
discounted  and  average  rewards.  For  example,  in  one  case  it  is  optimal 
to  increase  the  probability  of  backward  steps  as  the  location  i in- 
creases. Our  results  are  based  on  (1)  a criterion  for  monotone  optimal 
policies,  (2)  a result  describing  when  an  upper  envelope  of  concave 
functions  is  concave,  and  (3)  a relation  between  optimal  policies  for 
the  discounted  and  average  reward  criteria.  Procedures  for  computing 
optimal  policies  are  also  presented. 


Optimal  Control  of  Random  Walks 
by 

Richard  F.  Scrfozo,  Syracuse  University 
1 . I n t rod  uc^^n 

We  shall  study  a controlled  random  walk  on  the  nonnegative  Integers 
that  moves  as  follows.  Upon  arriving  at  a location  1 the  following 
events  occur: 

(1)  A pair  of  probabilities  (p  ,q  ) is  selected  from  the  set 

{(p  ,q  ),  ....  (p„,q„)}-  Think  of  the  (p^,q^),  or  the  a e {1,2,  ...,  m), 

XX  m III  ad 

as  the  action  taken.  We  assume  that  0 < p + q <1,  and  at  least  one 

a a ^ 

of  these  is  nonzero. 

(2)  A real-valued  reward  r(l,a)  is  received. 

(3)  The  next  location  of  the  walk  is  determined  by  the  transition 
probabilities 

p(i,a,i+l)  = p , p(i,a,l-l)  = q , p(i,a,i  ) = 

a cl  a cl 

when  1^1;  and 

p(0,a,l)  = p and  p(0,a,0)  = 1-p  when  i=0. 

That  is,  the  step  is  of  size  +1,  -1  or  0 with  respective  probabilities 
p ,q  and  1-p  -q  (except  at  location  0).  The  above  series  of  events 
is  repeated  indefinitely. 

A policy  f for  controlling  this  random  walk  (i.e.  a rule  for  selecting 
the  (p  ,q  ))  is  defined  to  be  a mapping  from  the  nonnegative  integers 
(the  state  space)  to  (1,2,  ...,  m}  (the  action  space).  Under  the  policy 
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f the  action  f(i)  is  taken,  l.e.  (Pf •‘If selected,  whenever 
the  walk  is  in  location  i.  We  shall  consider  only  these  so-called 
stationary  deterministic  policies.  Nothing  would  be  gained  by  considering 
nonstatonary  or  randomized  policies. 

Each  policy  f,  along  with  a rule  for  starting  the  process,  deter- 
mines a stochastic  process  {(X  ,a  ):  n ^ 0},  where  X is  the  location 

n n “ n 

of  the  walk  at  time  n,  and  a • f(X  ) is  the  action  taken.  The  expected 

n n 

discounted  reward  over  an  infinite  horizon  is 

00 

V^(i)  - E^(  I a"r(X^,a^)|x^=l), 
n=0 

where  0 < a < 1 is  a discount  factor.  The  average  reward  over  an  infinite 
horizon  is 

-1 

(|i  (i)  » 1^  n E^(  L r(Xj^,aj^)  |x^=i) . 

n-xD  k=0 

A policy  f*  is  called  a-dlscounted  optimal  if 

V .(1)  - sup  V (1)  for  all  1, 
f*  ^ f 

and  f*  is  called  average  optimal  if 

<J|£*(i)  ” sup  ((ij(i)  for  all  i. 

The  aim  is  to  find  such  policies.  We  shall  call  this  decision  process 
a controlled  random  walk.  It  is  a special  case  of  a Markov  decision 
process,  or  a controlled  Markov  chain. 

Decision  processes  that  arise  in  practice  often  have  Inherent 
properties  that  lead  to  nicely  structured  optimal  policies.  For  example, 
an  optimal  policy  f(l)  may  be  a monotone,  unlmodal,  or  convex  function 
of  1.  Knowing  that  there  is,  say,  an  increasing  optimal  policy,  then 
the  search  for  an  optimal  policy  may  be  confined  to  the  class  of  increasing 
policies.  An  optimal  policy  might  then  be  obtained  by  a simple  ad  hoc 
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procedure,  such  as  a calcuJus  argument.  This  is  especially  Important 
for  decision  processes  with  infinite  state  spaces  (like  ours)  where 
optimal  policies  cannot  be  obtained  by  the  standard  procedures  for 
processes  with  finite  state  spaces.  Structured  policies  are  also 
generally  easier  to  implement  than  unstructured  ones. 

In  this  paper  we  show,  under  some  very  general  conditions  on  the 
rewards  r(i,a)  and  the  probabilities  (p  ,q  ),  that  it  is  (discounted 

2L  cl 

and  average)  optimal  to  "increase"  the  probability  of  backward  movement 
of  the  process  as  the  location  of  the  walk  increases.  We  present  a 
similar  result  where  it  is  optimal  to  "decrease"  this  probability. 

We  show  how  these  results  carry  over  to  finite  time  horizons,  and  to 
walks  where  the  set  of  possible  probabilities  for  a step  depends  on 
the  location  wliere  the  step  is  taken.  We  also  present  procedures 
for  calculating  some  average  optimal  policies. 

Our  analysis  herein  is  based  on  three  key  results  that  apply  to 
more  genera]  Markov  decision  processes.  The  first  result  is  a criterion 
for  the  existence  of  a monotone  optimal  policy  (Proposition  4.1). 

Related  criteria  are  discussed  in  [6]  and  [8].  The  second  result 
describes  when  tlie  upper enve I ope  of  a family  of  functions,  defined  on 
the  integers,  is  concave  (Proposition  4.2).  This  result  enabled  us 
to  find  natural  conditions  on  the  rewards  r(i,a)  that  lead  to  monotone 
optimal  policies.  The  third  result  asserts,  under  some  weak  conditions, 
that  if  a Markov  decision  process  has  a discounted  optimal  policy  with 
a given  structure,  then  it  also  has  an  average  optimal  policy  with  the 
same  structure  (Theorem  5.1).  Part  of  this  result  is  an  extension  of 
[2,  Theorem  1]. 
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Applications  of  controlled  random  walks  arise  in  contexts  where 
the  descriptive  theory  of  random  walks  is  used.  In  a related  paper  [7], 
we  apply  the  results  herein  to  obtain  optimal  policies  for  controlling 
birth  and  death  processes  and  queues. 

2.  Monotone  Optimal  Policies^ for  fUndpm.Walks  Based  on  Discounted  Rewards 
In  this  section  we  identify  conditions  under  which  there  exist  in- 
creasing and  decreasing  discounted  optimal  policies  for  the  controlled 
random  walk.  (We  use  the  terms  increasing  and  decreasing  to  mean  non- 
decreasing  and  nonincreasing,  respectively.)  We  also  discuss  the  mono- 
tonicity  of  these  discounted  optimal  policies,  with  respect  to  the  dis- 
count factor. 

Here,  and  throughout  this  paper,  we  shall  use  the  notation  Introduced 
above.  We  shall  use  a prime  to  denote  the  difference  operator  with 
respect  to  i,  namely  u' (1)  = u(i+l)-u(i).  In  particular,  we  write 
r'(i,a)  » r(i+l,a)  - r(i,a). 

Our  first  result  concerns  increasing  policies.  A typical  increasing 
policy  f can  be  written  as 

f(i)  = a if  ia  < i < i,+i. 

where  0*1,  <i»<...<l  <!.,=“>.  This  means  that  if  the  walk 

1=2=  = m = m+1 

is  in  location  1,  and  i < i < 1 ,,,  then  action  a is  taken,  l.e. 

a = a+1’ 

(p  ,q  ) is  selected.  Note  that  the  action  increases  as  1 Increases. 

Also,  if  i^  “ ^a+1  ^ particular  action  a,  then  this  action  is  never 

taken. 

Th^ojem^2.1.  Suppose  the  following  conditions  hold. 

(1)  P,  > P»  > . . . > p , q,  < q^  < . . . < q , and  p,  + q <1. 

(2)  r'(i,l)  < r'(l,2)  < ...  < r'(l,m)  < 0 for  all  i. 

(3)  r'(l,l)  ^r'(l-H,m)  for  all  1. 
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Then  there  is  an  Increasing  a-discounted  optimal  policy  for  the  random  walk. 
We  shall  prove  this  after  we  make  a few  observations.  Theorem  2.1 


asserts  that  there  is  an  a-discounted  optimal  policy  which  selects  higher 

actions  in  {1,  ....  ml  as  the  location  i of  the  walk  increases.  Under 

this  policy,  because  of  assumption  (1),  the  selected  q is  an  increasing 

function  of  i,  and  the  selected  p is  a decreasing  function  of  i.  Their 

ratio  p/q  is  also  decreasing  in  i,  since  p,/q,  > . . . ^ p /q  . This 

1 1 — —mm 

means  that  the  tendency  of  backward  movement  of  the  walk  increases  as 
its  location  increases.  The  ratio  p/q  is  like  the  traffic  intensity 
of  a queueing  process.  We  tried  to  prove  Theorem  2.1  with  (1)  replaced 
by  the  weaker  condition  p,/q,  > ...  > p /q  , but  we  were  unsuccessful. 

We  feel  that  (1)  cannot  be  relaxed  this  way,  but  we  do  not  have  a 
counterexample  to  justify  this  conjecture. 

Note  that  assumption  (1)  poses  no  restriction  on  the  (p,q)'s  in 
the  following  important  examples. 

A Random  Walk  with  a Controy.ed_  Ascent. 

The  p 's  are  subscripted  so  that  p,  > ...  > p and  q,  = ...  = q . 

*^a  '^1  m 1 m 

A Random  Walk  with  a Controlled  Descent. 

The  q 's  are  subscripted  so  that  q,  < ...  < q and  p,  = ...  = p . 
a i — - m i m 

These  examples  are  analogous  to  an  M/M/1  queue  with  a controlled  arrival 
rate  and  a controlled  service  rate,  respectively.  In  [7]  we  show  that 
these  controlled  queues  are  actually  equivalent  to  the  above  random 
walks,  and  we  apply  the  results  herein  to  obtain  optimal  policies  for 
them. 

The  assumptions  (1)  - (3)  insure  that  the  value  function  of  the 
walk  (see  (5))  is  concave.  This  is  a key  ingredient  for  an  increasing 
policy  (see  the  verification  of  (8),  (9)  and  (13),  and  Proposition  4.2). 
Note  that  (2)  and  (3)  hold  if  and  only  if 


O^r' (0,m)^r' (0,m-l)^. . .^r' (0,l)^r' (l,m)^r' . .^r' (l,l)^r* (2,m) 
This  is  a very  weak  restriction  on  the  rewards.  It  is  satisfied,  for 
example,  when 

r(i,a)  » g(a)  - h(i) , 

where  h is  convex  increasing  and  g has  any  structure.  Another  consequence 
of  (2)  is  that  the  rewards  are  bounded  from  above.  Namely, 

(4)  sup  r(i,a)  4 max  r(0,a)  < 

i,a  a 

We  shall  use  the  following  notation  and  results  in  the  proof  of 
Theorem  2.1.  We  let 

00 

(5)  V(i)  - sup  V (i)  = sup  E.(  Z a^'rOC  ,a  ) |X  =i), 

f ‘ f ‘ k=0  K 1C  o 

V (i)  “ sup  E,(  Z a r(JL  ,a,)|X  =i)  for  n ^ 1, 

f k-0  K K o 

and  V (!)  r.  0.  These  are  the  infinite  and  finite  horizon  value  functions 
o 

of  the  random  walk.  Since  the  rewards  r(i,a)  are  bounded  from  above, 

it  follows  that  the  V are  finite-valued  and 

n 

i 5=  V(i)  < “ for  all  i and  f. 

From  the  theory  of  Markov  decision  processes  (or  dynamic  programming) 
with  upper  bounded  rewards,  we  know  that  the  following  statements  hold. 
These  come  from  the  basic  work  of  Bellman,  Blackwell,  Derman,  Howard, 
Strauch  and  others,  which  are  nicely  unified  and  extended  in  [4]  and 
[5]. 

(I)  (Existence  of  Stationary  Optimal  Policies)  An  a-discounted  optimal 
policy  exists. 

(II)  (Optimality  Criterion)  A policy  f is  a-dlscounted  optimal  if  and 
only  if 

U(l,f(i))  - max  U(l,a)  for  all  1, 

a 
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where 


U(i,a)  = r(i,a)  + a J:  pO  ,a,  j)V(j)  . 

j 


(iii)  (Optimality  Equations)  The  and  V satisfy  the  optimality  equations 

V^(i)  = max{r(i,a)  + alp  (1 , a,  j ) V^_  ( j ) } (n  ^ 1)  , and 
" s j " 

V(i)  = max{r(i,a)  + aZp (i ,a, j ) V( j ) } for  all  i. 

a j 

(iv)  (Value  Iteration)  For  all  i,  V(i)  = lim  V (i). 

n-x» 


Our  last  preliminary  for  the  proof  of  Theorem  2.1  is  the  following. 
Lemma  2.2.  If  (1)  - (3)  hold,  then  V^(i)  is  concave  decreasing  in  i 
for  each  n ^ 0. 

Proof . We  shall  prove  this  by  induction.  Trivially,  = 0 is  concave 

decreasing.  Assume  that  is  concave,  decreasing.  The  Optimality 

Equations  (iii)  can  be  written 

= max  U^(i,a) 
a 

where 

(6)  U^(i,a)  = r(i,a)  + a I p(i,a, j)V^(j) . 

j 

To  prove  that  is  decreasing,  it  suffices,  since  is  the  upper 

envelope  of  the  functions  ll^(*,l),  ...,  U^(*,m),  to  show 

(7)  U '(l,a)  < 0 for  all  a and  i. 

n “ 

And  to  prove  that  is  concave,  it  suffices,  by  Proposition  4.2  (in 

Section  4) , to  show 


(8)  U^'(i,l)  < U^'(i,2)  < ...  < U^’(i,m) 


for  all  i,  and 


(9)  U^’(i,l)  > U^'(Hl,m)  for  all  i. 


Writing  (6)  in  terms  of  the  p^  and  q we  get 


Uj^(i.a) 


r(0,a)  + n[(l-p  )V  (0)  + P V (1)] 

ci  II  a II 


for  1=0 

i r(l,a)  + a[q^V^(i-l)  + a-Pa-93)V„(D  + P^V^(i+D] 


for  1 > 1. 
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Then  for  any  i ^ 1, 

(10)  U ’(i.a)  = r'(i,a)  + a[q  V + (1-p  -q  )V  '(i)  + p V ' (i+1) ] 

n all  <1311  a 11 

= r'(i,a)  + u[V  '(1)  - q V "(i-1)  + p V "(i)]. 

n a n a 11 

Under  our  Induction  hypothesis,  the  V^' (i)  and  V^"(i)  = V^'(i+1)  - V^' (i) 

are  nonnegative.  Then  from  the  first  and  second  lines  in  (10),  and  the 
assumptions  (1)  and  (2),  it  follows  that  (7)  and  (8)  are  satisfied  for 
1^1.  The  inequality  (9)  is  also  satisfied  for  i ^ 1,  since  by  (1) 
and  ( 3 ) , 

(11)  U '(i+l,m)  - U '(i,l)  = r'(i+l,m)  - r'(i,l) 

n n 

+ a[q^V^"(i-l)  + 

By  similar  arguments  it  follows  that  (7)  - (9)  are  also  satisfied  for 
i = 0.  We  have  thus  proved  that  is  concave  decreasing,  and  this 

completes  our  induction  argument. 

We  are  now  ready  to  prove  Theorem  2.1  which  asserts  that  (1)  - (3) 
imply  the  existence  of  an  increasing  a-discounted  optimal  policy. 

Proof  of  Theorem  2.1.  Consider  the  policy 

(12)  f(i)  = max{a:  U(i,a)  = max  U(l,a)}, 

a 

where 

U(l,a)  = r(i,a)  + a 1 p(i,a, j)V(j) . 

j 

By  the  Optimality  Criterion  (ii) , this  f is  a-discounted  optimal.  To 
complete  the  proof,  we  need  only  show  taat  f is  increasing.  To  do 
this  it  suffices,  by  Proposition  4.1,  to  show 

(13)  U'(i,l)  < U’(l,2)  < ...  < U'(i,m)  for  all  1. 

To  this  end,  note  that 

U(l,a)  =(  r(0,a)  + o[(l-p  )V(0)  + p V(1)J  for  i = 0 

J 3 3 

1 r(i,a)  + a[q  V(l-l)  + (l-p,-q„)V(i)  + p V(i+1)]  for  i > 1. 

3 3 3 » 

Then 
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for  1=0 


(lA)  U'(i,a)  =(  r’(O.a)  + a[V’(0)  - q V’ (0)  + p V"(0)] 

I a a 

1 r'‘(i,a)  + a[V'(i)  - q V"(i-1)  + p V"(l)]  for  i > 1. 
By  Lemma  2.2  and  the  Value  Iteration  Property  (iv)  , It  follows  that  V 
Is  concave.  Then  using  (1),  (2),  V* (0)  ^ 0,  and  V"(i)  ^ 0 In  (lA) , 
we  obtain  (13).  This  completes  the  proof. 

We  have  just  shown  when  it  is  optimal  to  increase  the  probability 
of  backward  movement  of  the  random  walk  as  its  location  increases. 

This  tends  to  keep  the  walk  near  zero.  Our  next  result  describes  the 
opposite  situation  in  which  it  is  optimal  to  decrease  the  probability 
of  backward  movement  as  the  location  increases.  This  tends  to  push 
the  walk  toward  +",  accelerating  its  forward  niovement  as  it  approaches 
+“.  Similar  results  appear  in  [6]. 


Theorem_  2.3. 


Suppose  the  following  hold. 


(15)  p,  Po  > •••  P and  q,  <_  q«  < £ q . 

1 — z — m i — z — — m 

(16)  r'(i,l)  ^r'(i,2)  ^...  >_r'(i,m)  £0  for  all  i. 


(17)  r(i,a)  is  convex  increasing  in  i for  each  a. 


(18)  max  r(i,a)  ^g(i),  wlicre  g is  a polynomial  function  in  i. 
a 

Then  there  is  a decreasing  a-discounted  optimal  policy  for  the  random  walk. 

Note  that  this  result  does  not  require,  as  Theorem  2.1  does,  that 
p^  + ^1-  The  assumptions  (16)  - (18)  are  satisfied  if  r(i,a)  = 

g^(a)  + g2(i)»  where  g^(i)  is  a convex  increasing  polynomial  in  i. 


Proof . A sufficient  condition  for  the  above  dynamic  programming  statements 
(i)  - (iv)  to  hold,  and  the  and  V to  exist,  is  that 


(19)  11m  sup  E^(  I I 1^0  “ ® 

n->=>  f k=n 
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I See  [5].  If  the  g In  (18)  Is  of  the  form  g(l)  1*^,  then  using  the 

fact  that  have 

OO  00 

! E^(  I a^|r(Xj^,aj^)(|XQ  = 0)  < L a*Sc^  < ». 

k=n  k=n 

Similar  bounds  for  this  expected  value  can  be  obtained  for  any  poly- 
nomial g and  any  value  of  Xq.  These  bounds  are  sufficient  for  (19)  to 
! hold. 

f 

Proceeding  as  In  the  proof  of  Theorem  2.1,  we  consider  the  policy 

^ f(i)  = max{a:  u(i,a)  = max  U(i,a)}. 

a 

This  is  a-dlscounted  optimal  by  the  Optimality  Criterion.  By  an  induction 

argument,  as  in  the  proof  of  Lemma  2.2,  it  follows  that  each  n-period 

value  function  V (i)  is  convex  increasing  in  i.  Here  the  V is  convex  increasinfu 
n n 

j since  it  is  the  upper  envelope  of  •••»  ’ which  are  clearly 

I convex  increasing.  Then  V(i)  = lim  V (1)  is  convex  increasing.  Finally  arguing 

n-x» 

as  in  the  proof  of  Theorem  2.1,  it  follows  that  f is  decreasing. 

Our  final  result  in  this  section  concerns  the  monotonicity  of 
' a-dlscountcd  optiraal  policies,  with  respect  to  the  discount  factor  a. 

' This  is  of  interest  by  itself.  It  is  also  a key  result  for  obtaining 

average  optimal  policies  from  discount  optiraal  policies,  which  we  do 
I in  Section  6. 

We  shall  assume  here  that  we  are  dealing  with  a Markov  decision 

process  with  transition  probabilities  p(i,a,j),  and  revyards  r(i,a), 

' which  arc  bounded  from  above.  We  let 

(20)  f (i)  = nuix{a:  U (i,a)  •=  max  U (i,a)}, 

a 

where 
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I 


U (i,a)  = r(i,a)  + a Z p(i,a, j)V(i) . 

J 

According  to  the  Optimality  Criterion,  the  is  an  <i-d iscounted  optimal 
pol icy. 

Theorem  2.4.  If  it>  increasing  in  i for  each  a,  and 

r(i,l)  ^ r(i,2)  > ...  ^ r(i,m),  for  some  i 
then  f (i)  < f^(i)  for  this  i and  all  0 a < 6 < 1. 

Proof.  Let  b = f^(i).  For  any  a < (4  it  follows  by  the  definition  of 

f and  the  hypothesis  that 
a 

0 < U^(i,b)  - U^(i,a)  = r(i,b)  - r(i,a)  + u Z[p(i,b,j)  - p (i ,a, j ) ] V( j ) 

J 

Z[p(i,b,j)  - p(i,a,j)]V(i). 

j 

Using  this  inequality  we  have 

U (l,b)  - U (i,a)  > U (i,b)  - U (i,a)  > 0 for  a < B. 

p p a (I  — 

From  this,  and  the  assumption  that  f„  (i)  is  increasing,  we  get 

R 

f (i)  > b = f (i)  for  a < p.  This  completes  the  proof. 

R U 

Examgi]^2.5.  Consider  the  controlled  random  walk  with  rewards 
r(i,a)  = g(a)  - h(i),  where  h(‘)  is  convex  and  increasing,  and  g(') 
is  decreasing.  By  Theorem  2.1  there  is  an  increasing  a-discounted 
optimal  policy  f^^,  as  defined  by  (20).  Then  by  Theorem  2.4  we  have 
f (i)  < f„(i)  for  all  rx  6 and  i. 

U R 

3.  Monotone  Discount  Optimal  Policies  for  Random  Wal^s  with  State 
Dependent  Transitions 

We  have  been  discussing  a random  walk  in  which  each  step  size  is  deter- 
mined by  a pair  of  probabilities  selected  from  the  set  {(p,,q.),  ....  (p  ,q  )), 

1 i tn  m 

where  this  set  is  independent  of  the  location  of  the  walk.  We  now  consider 
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the  case  where  this  set  of  probabilities  is  dependent  on  the  location 
of  the  walk.  We  present  analogs  of  Theorems  2.1  and  2.3. 

We  shall  assume  (only  in  this  section)  that  the  random  walk 
moves  as  follows.  Upon  arriving  at  location  1,  the  following  events 
occur: 

(1)  A pair  of  probabilities  (p(l,a),  q(i,a))  is  selected  from  the  set 
{(p(l,l),  q(l,l)),  ....  (p(i,m) ,q(i,m)) }. 

(2)  A reward  r(l,a)  is  received. 

(3)  The  next  location  of  the  walk  is  determined  by  the  transition 
probabilities 

p(l,a,i+l)  = p(i,a),  p(i,a,i-l)  = q(l,a),  p(i,a,l)  = 1 - p(l,a)  - q(i,a 
when  i ^ 1,  and 

p(0,a,l)  = p(0,a)  and  p(0,a,0)  = 1 - p(0,a)  when  i = 0. 

The  above  series  of  events  are  repeated  indefinitely. 

In  the  following,  we  let 

d(l,a)  =■  q(i,a)  - p(i,a). 

Theqr^em  3.1.  Suppose  the  following  conditions  hold. 

(4)  p(i,l)  ^ i P(i,tn),  q(i,l)  4 ...  4 q(i,m)  and  p(l,l)  + q(i,m)  4 1 
for  all  1. 

(5)  d' (1,1)  4 . • . 4 d' (l,m)  4 0 and  d'  (i,l)  4 d' (1+1, m)  for  all  i. 

(6)  r'(i,l)  4 •••  4 r'(i,''n)  4 0 for  all  1. 

(7)  r'(l,l)  4 r'(l+l,m)  for  all  i. 

Then  there  is  an  increasing  a-dlscounted  optimal  policy  for  the  random 
walk. 

This  is  similar  to  Theorem  2.1,  except  for  the  additional  condition 
(5).  It  can  be  proved  Just  as  we  proved  Theorem  2.1.  The  key  steps 
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are  to  observe  the  following  analogs  of  (10)  and  (11)  in  Section  2: 

U^’d.a)  = r'(i.a)  + ..( (1-d  ' ( i . a) ) V^' (1 ) - q (1  ,a)V^"(i-l)  + p(i,a)V^"(l+l)  } , 

and 

U^'(l+l,m)  - U '(1,1)  = r'(l+l.m)  - r'(i.l)  + a{q(l,l)V  "(1-1) 
n n n 

+ [1  - p(i+l,l)  - q(i+l,m)  - d'(l+l,m)]V  "(1) 

n 

+ [d’(i.l)  - d'(i+l,m)]V’(i)  + p(l+2,m)V  "(1+1)}  < 0 

n ~ 

The  analog  of  Theorem  2.3  is  as  follows. 

Theo_rem  3.2.  Suppose  the  following  conditions  hold. 

(8)  p(l,l)  > ...  ^ p(i,m)  and  q(i,l)  ^ ...  < q(l,m)  for  all  1. 

(9)  d'(i.l)  1 < d'(i,m)  for  all  1. 

(10)  d(l,a)  is  concave  decreasing  in  1 for  each  a. 

j (11)  r'(i.l)  > ...  > r'(i,m)  for  all  i 

(12)  r(l,a)  is  convex  in  i for  each  a 

(13)  max|r(i,a)|  < g(i),  where  g is  a polynomial  in  i. 

' a 

Then  there  is  a decreasing  a-discounted  optimal  policy  for  the  random 
walk. 

4.  Criteria  for  Monotone  Optimal  Policies  and  Concave  Value  Functions 
In  this  section  we  present  two  key  results  which  we  used  above 
for  establishing  the  existence  of  monotone  optimal  policies  for  our 
random  walk. 

[ We  shall  consider  the  general  optimization  problem 

v(i)  = max  u(l,a)  for  1 = 0,1,... 
a 

where  a t {1,  ...,  m}  and  u is  a real-valued  function.  An  optimal 
policy  for  this  problem  is  defined  to  be  any  mapping  f from  {0,1,...}  to 
{l,2,...,m}  which  satisfies 
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u(l,f(l))  = max  u(i,a)  for  all  1. 
a 

Note  that  this  is  an  abstraction  of  the  Optimality  Criterion  in  dynamic 
programming  (recall  statement  (11)  in  Section  2). 

Our  first  result  describes  sufficient  conditions  for  the  existence 
of  monotone  policies.  Variations  of  this,  along  with  other  applications, 
are  discussed  In  [6]  and  [8]. 

^qposltlQn^ A.  1.  Let  f be  the  optimal  policy  defined  by 

f(l)  = max{a:  u(l,a)  = max  u(l,a)}. 

a 


The  optimal  policy  f is  Increasing  if 

(1)  u'(l,l)  < u'(i,2)  < ...  < u'(l,m)  for  all  i. 

The  optimal  policy  f is  decreasing  if 

(2)  u'(i,l)  > u'(l,2)  > ...  > u'(i,m)  for  all  i. 

Proof . Suppose  (1)  holds,  and  there  is  an  i such  that  f(l+l)  < f(l). 
By  the  definition  of  f(l)  and  (1) i we  have 


0 < u(i,f(D)  - u(i,f(i+D)  < u(i+l,f(i))  - u(i+l,f(i+D), 
and  so  u(l+l,f (1+1))  ^ u(i+l, f (i)) . But  this  contradicts  the  definition 
of  f(l+l).  Thus  f must  be  increasing.  The  assertion  that  (2)  Implies 
that  f Is  decreasing  is  proved  similarly. 

In  order  to  apply  Proposition  A. 1 when  u(i,a)  is  a function  of  v 
(as  we  did  in  Section  2),  some  knowledge  of  the  structure  of  the  value 
function  V may  be  required.  Since  v is  the  upper  envelope  of  u(*,l), 

...,  u(*,m),  then  v is  obviously  convex,  increasing  or  decreasing  when 
all  of  the  u(*,a)'s  are  convex,  increasing  or  decreasing,  respectively. 
The  next  result  describes  conditions  under  which  v is  concave. 

Proposition  A. 2.  The  function  v is  concave  if  either  of  the  following 


conditions  hold. 


(3)  u'(i,l)  < u'(l,2)  < ...  < u'(i,m)  and  u'(i,l)  fiu'(l+l,m)  for  all  1. 

(4)  u'(l,l)  > u’(i,2)  > ...  > u'(l,m)  and  u’(l,m;>  u'(i+l,l)  for  all  1. 
Proof.  Suppose  (3)  holds  and  let  f be  the  optimal  policy  In  Proposition 
4.1.  Using  (3)  and  the  Increasing  property  of  f we  have 


v'(l)  = u(l+l,f(l+D)  - u(l.f(D)  > u(l+l,f(D)  - u(l,f(D) 

> u'(l,l)  > u’(l+l.m)  > u(l+2,f  (1+2))  - u(l+l,f(l+2))  ^v'd+l) 
Thus  V Is  concave.  A similar  argument  shows  that  v Is  concave  If  (4)  holds. 

5.  Di^cjpunted  and  Average  Reward  Optimal  Policies  of  Similar  Structure 
If  a Markov  decision  process  has  a discounted  optimal  policy  with  a 
special  structure,  then  it  seems  reasonable  that  there  should  be  an 
average  optimal  policy  with  the  same  structure.  We  shall  show  that 
this  Is  true  in  a fairly  general  setting.  In  the  next  section  we 
apply  this  to  our  random  walk. 

We  shall  consider  a Markov  decision  process  with  rewards  r(i,a), 
and  transition  probabilities  p(l,a,j)  for  i,j  ^ 0 and  a in  some  set. 

We  let  n denote  the  set  of  all  policies  f under  which  the  a-discounted 
reward  function  V^(i)  is  finite-valued  for  all  0 < a < 1,  and  the 
limit 


4>,(i)  = lim  n ^ E,(  Z r (Xj^.a^^)  , 

^ 


n-1 
( 1 1 
k=0 


i) 


exists  for  all  i,  where  < 4ij(i)  < “. 


Theorem  5.1.  Suppose  the  Markov  decision  process  described  above  has 

upper  bounded  rewards,  and  there  is  a set  of  policies  F » 

in  n such  that  f^  is  a^-dlscounted  optimal,  where  is  a sequence  with 

a -♦I.  Then 
n 


(1)  sup  <l>f(i)  •=  sup  4i,(i)  for  all  1. 

fen  ^ fcF 
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-*TT’ 


KT 


If,  in  addition,  F is  a finite  set,  then  there  is  a policy  f*  e F such 
that 

(2)  " sup  <(i.(l)  for  all  1. 

fen 

The  second  part  of  this  result  is  a slight  extension  of  [2,  Theorem  1]. 
The  first  part  is  new.  The  usefulness  of  Theorem  5.1  is  illustrated  in 
the  next  result  which  follows  immediately. 

Corollary  5.2.  If  the  Markov  decision  process  in  Theorem  5.1  has  an 
increasing  a-discounted  optimal  policy  for  each  a,  the  set  of  such 
policies  is  finite,  and  V^(l)  = for  all  policies  f ^ H,  then  there 
exists  an  Increasing  average  optimal  policy. 

Proof  of  Theorem  5.1.  Suppose  for  now  that  the  rewards  r(l,a)  are  all 
nonpositive.  We  first  note  that  for  any  f e n, 

(3)  4>,(i)  - lim  (l-a)V^(i)  for  all  i. 

a->-l 

This  follows  by  the  well-known  Abelian  Theorem  [3,p.445],  when 
is  finite.  And  it  follows  when  0j:(i)  = since 

V 

“ k 

(4)  (l-a)V^(i)  < (l-a)Ej(  I a*"r(X^,aj^)  |x^  = i) 

k“0 

V 

-1  “ k 

4 v^  E^(  I a r (Xj^,aj^)  |X^  =!)-*■  <()j(i)  = -“  as  a 1, 
k*0 

where  v^  is  the  integer  part  of  (1-a) 

Using  (3)  and  the  assumption  that  f^  is  a^-discounted  optimal,  we 

have 

sup  4i,(i)  • sup  lim  (1-a  )V  (1)  ^ lim  (1-a  )V  (1) 
fen  fen  n-+«  ” n-*<«  " n 

< sup  lim  (1-a  )V,(i)  » sup  (Jif(i). 

“ feF  n-xx.  " ^ feF 
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Furthermore,  the  first  term  in  the  above  is  always  greater  than  or 
equal  to  this  last  term,  and  so  they  are  equal.  This  proves  (1). 

Now  assume  that  F is  finite.  Then  there  is  an  f*  e F which  is 
a -discounted  optimal  for  k = 1,  2,  ...  where  o is  some  subsequence 

of  a^.  Using  Theorem  1 in  [9,p.l81]  (for  our  nonpositive  rewards!) 

and  (3),  it  follows  for  any  f e n that 

(l-a)Vj(i)  < (1-a^  )Vf*(i)  = 

a->l  k->”  k 

This  proves  (2). 

We  now  prove  (1)  and  (2)  for  upper  bounded  rewards.  Let  c be  an 
upper  bound  for  the  r(i,a)'s,  and  consider  the  Markov  decision  process 
with  rewards  f(i,a)  = r(i,a)  - c,  transition  probabilities  p(i,a,j), 
and  average  rewards  (p^.  This  process  has  the  same  set  of  a-discounted 
optimal  policies  as  the  original  process,  its  rewards  are  nonpositive, 
and  (p^  = c.  Thus,  by  the  above 

sup  (ffCi)  = sup(i)>,(i)  + c)  = sup(())_(i)  + c)  = sup  i|)f(i). 

fen  fen  feF  feF 

Now  suppose  F is  finite  and  f*  e F is  as  defined  in  the  preceeding 

paragraph.  Then 

♦^^(i)  = + c = sup  4>^(i)  + c = sup  <}i,(i)  for  all  i. 

fen  fen 

This  completes  the  proof. 

6.  Monotone  Optimal  Policies  for  Random  Walks  Based  on  Average  Rewards 
Theorem  2.1  describes  conditions  under  which  there  exists  an 
increasing  a-dlscounted  optimal  policy  for  the  random  walk.  In  this 
section,  we  show  that  these  condition,  with  some  minor  additions, 
are  also  sufficient  for  the  existence  of  an  Increasing  average  optimal 
policy.  A similar  result  holds  for  decreasing  average  optimal  policies 
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(based  on  Theorem  2.3),  but  for  the  sake  of  brevity,  we  shall  not 
discuss  It. 

We  shall  consider  the  random  walk  as  described  in  Sections  1 and 
2.  In  our  first  result,  we  use  the  following  conditions. 

(1)  P,  i i P„.  q,  < • • • i q„,  and  p + q„  < 1. 

i — — mi—  — m i m-^ 

(2)  > 0,  > 0 and  P^,/qj^  < 1. 

(3)  r'(l,l)  i •••  4 r' (i,m)  4 0 for  all  i,  and  at  least  one  of  the 
r'(i,a)  is  nonzero. 

(4)  r*(i,l)  ^r'(i+l,m)  for  all  1. 

Under  these  assumptions  the  average  reward 

^^(i)  = 11m  n S r(Xj^,aj^)  |X^  = i) 

n->^  k=0 

exists  for  any  policy  f and  4 <t>£(i)  ” (see  Proposition  6.2). 

Moreover,  4'£(1)  1®  independent  of  i,  so  we  shall  simply  denote  it  by 

iji^.  We  let  I denote  the  set  of  increasing  a-discounted  optimal  policies 

for  0 < a < 1.  Such  policies  exist  under  (1)  - (4),  by  Theorem  2.1. 

I^emrem  6. 1 . Suppose  the  random  walk  satisfies  (1)  - (4).  Then 

sup  '^C  - sup  <tl,. 
f ^ fel  ^ 

If,  in  addition,  there  is  an  N such  that 

r(i,l)  ^ r(i,2)  ^ ^ r(l,m)  for  all  i ^ N, 

then  there  is  an  increasing  policy  in  I that  is  average  optimal  for 
the  random  walk. 

The  first  assertion  says  that  the  increasing  policies  in  I yield 
the  largest  average  reward,  but  it  doesn't  say  that  one  of  the  policies 
In  I actually  attains  the  maximum  reward.  The  second  assertion  does. 

The  assumptions  (1)  - (4)  are  essentially  assumptions  (1)  - (3)  in 
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Theorem  2.1  with  a few  minor  additions.  These  additions  simply  eliminate 

some  degenerate  cases.  Specifically,  we  assume  that  at  least  one  of 

the  r'(i,a)  is  nonzero  to  rule  out  the  case  where  tiie  rewards  do  not 

depend  on  i.  Witli  tliis  case  ruled  out,  (3)  and  (A)  imply  that  r(i,a)  1 - 

as  1 ► “ for  all  a.  We  assume  > 0 for  the  sake  of  brevity.  The 

analysis  presented  here  also  carries  over  to  the  case  when  some  of 

the  q 's  are  zero,  but  more  details  are  involved.  The  p,  > 0,  in 
a i 

conjunction  with  q^  > U,  just  eliminates  the  case  in  which  = •••  ~ Pj„ 

and  each  policy  determines  a walk  that  is  absorbed  at  zero.  Even  though 

p > 0,  some  of  tiie  otlier  p 's  may  be  zero.  The  p /q  <1,  eliminates 
*^1  'a  m m 

the  case  in  which  each  policy  f determines  a walk  whose  states  are  all 
transient  or  null  recurrent,  and  whose  average  reward  (see 

Proposition  6.2).  Hare  any  policy  is  average  optimal. 

Proof . If  ({ij.  = for  all  f,  then  tiie  assertions  are  trivally  satisfied. 

Now  suppose  there  is  an  f^  with  ^ It  follows  that  its 

o 

u-dlscounted  reward  V^.  (i)  > for  all  i and  u.  Consequently,  for 

o 

eacli  f £.  I we  have  Vj.(i)  ^ (i)  ? for  all  i.  In  addition, 

o 

for  each  f l I.  For  if  not,  tlien  arguing  as  in  (A)  in  Section  5,  we 

would  have  Vj.(i)  = -■".  Let  II  be  tiie  set  of  policies  f for  which 

Then  from  Proposition  6.2,  Theorem  2.1,  and  Tlieorem  5.1, it  follows  that 

sup  <j)  = sup  4.  = sup  4>  . 

f ' fen  ' fcl  ' 

We  now  establ  isli  tiie  existence  of  an  increasing  average  optimal 

policy.  Let  a denote  the  largest  action  taken  by  any  policy  in  I.  Let 

f be  a policy  in  I that  takes  the  action  a,  and  let  a be  a discount 

o 

factor  such  that  f is  a^-d iscounted  optimal.  Let  P be  the  set  of 
increasing  u-dlscounted  policies  f^^,  as  constructed  in  tiie  proof  of 
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Theorem  2.1,  for  all  < 1.  Then  Theorem  2.4  yields 

f (1)  = a for  all  1 > mln{j  > N:  f(j)  = a}  and  a > a . 

a ^ -j  _ 'j'  o 

Consequently,  F is  a finite  set.  Thus  from  Theorem  5.1  it  follows 
that  there  is  an  increasing  policy  in  I that  is  average  optimal  for  the 
random  walk. 

In  the  next  result  we  present  an  expression  for  the  average 
reward  function  (P^  of  our  random  walk.  We  used  this  in  the  above  proof 
and  we  shall  use  it  again  in  the  next  section. 

Proposj-tlon  6.2.  Suppose  the  random  walk  satisfies  (1),  (2)  and 
r(l,a)  + -oo  as  1 “*•  “>  for  all  a.  Let  f be  a policy,  and  let 


^i  = J/f(k)/'lf(k+l) 


N =1“  if  ^ 0 ^ 

(^min{i  > 0:  = 0}  otherwise. 

Case  1.  The  random  walk  under  f is  such  that  {0,1,..., N}  is  a closed 


class  of  positive  recurrent  states  and  (N+l,N+2, . . . } are  transient 


states  if  and  only  if  Fy,  < “•  (When  N = " the  latter  set  is  null.) 

i 

In  this  case,  the  limiting  distribution  of  the  walk  is 


(1) 


if  i > 1 


(1  + yJ 
i=l 


if  1 = 0, 


(if  N < “ , then  7r^(l)  = = 0 for  1 > N) , and 


PAD  = r(0,f(0))n  (0)  + F r(k,f(k))y 
^ ^ k=l 


for  all  i. 


Case  2.  The  random  walk  under  f has  all  transient  or  null  recurrent 
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states  if  and  only  if  Y.  y . = In  this  case  i* 

Proof.  The  assertions  concerning  the  classification  of  states  follow 

by  standard  arguments  for  Markov  chains.  For  the  case  in  which  Zy.  < 

i 

we  have 

no  N 

^,(i)  = Z r(k,f(k))TT.(k)  = r(0,f(0))7i  (0)  + Z r(k,f(k))y.  . 

^ k=0  k=l 

See  [1, Corollary  6.2.23.]  In  this  reference  the  assumption  that  r(k,f(k)) 

is  bounded  can  be  relaxed:  one  only  needs  that  the  above  sum  exists 

(±  <»  being  possible  values).  In  our  context, 

It  remains  to  prove  the  last  assertion.  To  this  end,  assume  that 

Zy  = Let  v.(n)  denote  the  number  of  visits  that  the  random  walk, 

1 ^ ^ 

under  f,  makes  to  state  i in  n steps.  The  r(i,a)  is  decreasing  in  i, 

and  so  for  each  j > 1, 

n j “1  °° 

n ^ Z r(X,  ,a  ) = n ^ Z v (n)r (i, f (i) ) + n ^ Z v (n)r(l,f(i)) 
k=0  i=0  ^ i=j  ^ 

j — 1 00 

^ r(0,f(0))n  ^ Z V (n)  + r(j,f(j))n  ^ Z v . (n) 
i=0  l=j 


-1 


j-1 


= r(j,f(j))  + n " Z V (n)[r(0,f(0))  - Ij. 

1=0 

Since  each  i is  transient  or  null  recurrent,  then  n v^(n)  ->  0 a.s. 
It  follows  that 


-1. 


n-1 


lim  n Z r(X^,aj^)lX^  = i)  < r(J,f(j)) 

n-w  k»0 


Letting  j ->■  <»  yields  <p^(i)  = -®  for  all  1. 

7.  Computation  of  Optimal  Average  Reward  Policies:  The  Two  Action  Case 
We  shall  show  how  to  compute  an  increasing  average  optimal  policy 
for  a random  walk  with  two  control  action  (Pj^.q^^)  and  (P2tq2)*  We 
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discuss  the  multi-action  case  in  the  next  section. 

In  addition  to  the  notation  in  Sections  1,2  and  6,  we  let  p = p /q 

di  Si  Q 

and  let 

. n-1 

D =■  -(p  - p„)(l  - p~)  r(i,l)p  ^ + p [r(n,D  - r(n,2)] 

n 12  2 . „ 11 

1=0 

[p/“  p„(l-p„)~-^  + I p ] + (p  - p )(  2 p/)  2 r(i,2)p 
^ i=0  i=0  i=n 

For  each  n(0  ^ n ^ , we  define  the  policy 

f (i)  =(  1 if  0 < i < n 

I 2 if  i ^ n. 

Note  that  f (i)  H 1 and  f (i)  = 2.  We  also  let 

00  O 

n*  =foo  if  D >0  for  all  n > 0 

[ min{n  > 0:  D <0}  otherwise. 

= n = 

Theorem^ 7.1.  Suppose  the  random  walk  with  two  actions  satisfies  the 
following  conditions. 

(1)  0 < 4 q2»  > 02>  P£  ^ ^nd  p^^  + q2  < 1- 

(2)  r'(i,l)  ^ r'(l,2)  < 0 for  all  i,  and  r'(i,a)  ^ 0 for  some  a and  i. 

(3)  r'(i,l)  > r'(i+l,2)  for  all  i. 

00 

(4)  2 r(i,2)p/  > 

1-0 

Then  the  policy  f . is  average  optimal, 
n* 

This  says  that  it  is  average  optimal  to  select  (Pj^.q^^)  when  the 

walk  la  below  n*  and  to  select  (P2»q2)  otherwise.  The  n*  can  be  obtained 

when  r(i,2)  is  tractable,  say  a polynomial  in  i,  by  successively  computing 

D,  ,0.,,...  until  a D is  reached  such  that  D < 0.  Then  n*  = n.  The  case 
1’  2 n n = 

when  r(l,a)  - g(a)  - hi  is  discussed  below.  In  the  proof  of  Theorem  7.1 
we  show  that  D is  decreasing  in  n.  This  can  be  used  in  an  obvious 
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way  to  shorten  tlie  procedure  for  obtaining  the  n*. 

Theorem  7.1  is  essentiaily  a special  case  of  Theorem  6.1.  We 
added  assumption  (4)  here,  because  witliout  it,  any  increasing  policy 
is  average  optimal.  This  follows  from  Theorem  6.1  and  the  fact  that 
the  average  reward  for  each  increasing  policy  is  as  seen  from  (8) 
and  (9)  below. 

Proof  of  Theorem  7.1.  Let  denote  the  average  reward  of  the  walk 
under  the  policy  ^ . From  Theorem  6.1  it  follows  that 


sup  (t,  = sup  4 . 
c r n 

f n 

Then  in  order  to  prove  that  the  policy  f . is  average  optimal  it  suffices 
to  show 


(6) 


. = sup  (}>  . 
n*  n 


To  this  end,  we  first  note  that  by  Proposition  6.2,  the  limiting 

distribution  tt  (•)  of  the  walk  under  policy  f is  as  follows: 
n ' n 


TT^(i)  = (l-Pj^)p^ 


for  1 > 0, 


and  for  n ^ 0, 

TT  (i)  = r TT  (0)p,^  for  0 < i < n 
n \ n 1 ^ 


I n-1  i-n+1 

L for  i n. 


where 


n-1  . -1 

r r*  ^1  ri“X  / • V i "1 
TT  (0)  = Ip  + p P (1-0  ) j 

k=0  i ^ ^ 

-1 

(Here  1 = 0.)  Another  application  of  Proposition  6.2,  using  the  above 

k=0 


Ti^'s,  yields 
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00 


(7)  <^  - (1-p  ) J;  r(i.l)Pi 

i=0 


and  for  n > 0, 


n-1 


(8) 


1 , n-1 


» Tt  (0)  j E r(i,l)Pi  + Pi  Z r(l,2)p 
" " ‘ 1=0  ^ ^ l=n  ^ 


t-n+1 


{ . 

j 


Note  that 


(9)  4i  = 11m  ((i  . 
“>  n 

n-x*> 


We  now  show  that  has  a global  maximum.  Using  (8)  we  have 


(10) 


-1,  ..  I 


+ p "tt  (0)  ^(r(n,l)  - r(n,2)) 

1 n 

00 

+ Pi"~^(Pi"n(0)~^  - ^ r(i,2)P2^""}. 

i=n 

From  the  above  expression  for  Tt  (0) , we  obtain 

n 

itn(0)  ^ - '"n+]^(0)  ^ = Pj^"  ^(P2  “ Pj^)/(1  - P2)> 


and 


Pj.„(0)-1  - ■ ‘"l  - 

k»0 

Using  these  In  (10)  yields 

(11)  ♦ ^1  - " P (0)tt  ^i(0)D  , 

n+1  n 1 n n+1  n 

where  D is  defined  above.  In  light  of  this  factorization,  the 

will  have  a global  maximum  if  D is  decreasing  in  n.  From  the  definition 


of  and  some  algebraic  manipulations  we  can  write 
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\ 


(12)  " *^n  " ^ ~ ™n+]  ^)r(n,l)  + P (0)  'fi*(n+l,l)  - r(n+l,2); 

n-1 

- P^TT^(0)“^[r(n,l)  - r(n,2)]  + (p^  -p^)  ( 2 P^  (1  - P2)  + P^") 

k=0 


I r(i,2)p2^  - [pj^Ti^(O)  ^ - P2”n+i(0)  ^]r(n,2) 

i=n+l 


= {pj^r'(n,l)  - P2r'(n,2) 


(P^-P2)[(l-P2)  ^ r(l,2)p2^  " ^ - r(n+l,2)]}. 

i=n+l 

Under  our  assumption  (2),  we  have  r'(n,l)  ^ r'(n,2).  And  for  i ^ n + 1, 


r(i,2)  _-^r(n+l,2)  + (i-n-l)r' (n+1,2) , 


so  that 


I r(i,2)P2^  " ^ r(n+l,2)(l  - P2)  ^ + p2(^^  ~ P2^  ^r'(n+l,2). 

i=n+l 

Using  these  expressions  in  (12)  yields 

\*l  - “n  ' VP”'"’'"!  - - “2>''  i »• 

This  says  that  U^  is  decreasing,  and  so  from  (11)  we  know  that  1)1^  has 
a global  maximum. 

Suppose  that  > 0 for  all  n.  Tlien  (<)^  is  increasing,  and  recall 

that  n*  = Thus  from  (9)  we  get 

(j>  . = ij)  = sup  <(i  . 
n*  n 

n 

Now  suppose  D <0  for  some  n.  Here  the  4>  Increases  until  n reaches 
n ^ n 

n*  *=  min{n:  D < 0),  and  then  it  decreases  to  it  , because  of  (9). 

n ’ ^00  ’ ' ' 

Consequently,  ‘t’  ^ = sup  i(i  . Since  these  two  cases  cover  all  possibilities. 


if  follows  that  the  policy  f ^ is  average  optimal. 
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I 


We  now  consider  a special  case  of  the  preceding  result. 


Corollary  7.2.  Suppose  the  random  walk  with  two  actions  satisfies 
the  following  conditions. 

(13)  0 < qj^  < q^,  > P2«  P2  ^ 1 ^1  ^*2  “ 

(lA)  r(i,a)  - g(a)  - hi,  where  g(l)  > g(2)  and  h > 0. 

Then  an  average  optimal  policy  is  to  select  ''hen  the  walk  is 

below  n*,  and  select  (P2iq2^  otherwise.  The  n*  is  the  smallest  non- 
negative Integer  n for  which  ^ 0,  where 

^ n + cPj^"  + c - P2(g(l)  - g(2))/(h(p^  - P2>  (1  - p2>)  if  Pj^  1 
2 

( n + n(l  + P2)/(l  - P2)  “ 2p2(g(l)  - g(2))/(h(pj^  - P2>)  if  Pj^  = 1, 

and  c ■ (Pj^  - P2)/((l  - Pj^)  (1  - P2))-  Furthermore, 

(16)  n*  <(p2(g(l)  - g(2))/(h(p^  - P2)(l  - P^))  if  P^  1 

I 1/2 

i I2p2(g(l)  - g(2))/(h(p^  - P2))]  if  Pj^  = 1. 


Proof.  By  Theorem  7.1  it  follows  that  an  average  optimal  policy  is  to 
select  when  the  walk  is  below  n*  and  select  (P2»q2^  otherwise. 
Here 


n*  « ^ 00 

/ 

( min{n: 


if  D >0  for  all  n 
n 

D <0}  otherwise, 
n ^ 


where 


and  z 

n 


+ p^(g(l)  - g(2))Ip2P^""^(l  - P2)"^  + 

00 

+ z„(Pi  - P2)U(2)(1  - P2)"^  - 
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We  shall  now  show  that  n*  = min{n:  D ^0}.  Using  the  identities 


00  <X>  CO 

Z lp2^""  - P2  ^ (i-n)P2^'"“^  + n I = P2(l-p2)"^  + n(l-P2)"^. 

l=n  i“n  i=n 


and 


n-1  . 

E ip  - [1  - p^"  - n(l-p^)p^'']/(l-Pj^) 
i=0 


if  Pj^  1 


= n(n-l)/2 


in  the  above  expression  for  D , yields 

n 


if  Pj^  »=  1, 


-1  -1  1 -1 
On  - p2(gU)  - g(2))(l-P2)  + h(p^-P2)(l-P2)  [ 


-h(pj^-p2)  (1-Pj^)  ^(1-P2) 

-d/2)h(Pj^-P2)(l-P2)“\ 


if  Pj^  1 


if  Pi 


Note  that  is  strictly  increasing  and  eventually  becomes  positive. 
Consequently, 

n*  = min{n:  D £ 0}  = min{n:  D > 0}. 

n “ n -* 

The  n*  is  bounded  as  Indicated  in  (16).  This  follows  since  is  strictly 
increasing,  and  clearly  > 0 when  n equal  or  exceeds  the  right  size  of  (16). 
8.  A Linear  Program  for  Computing  Average  Optimal  Policies 

The  random  walk  we  have  been  studying  has  an  infinite  state  space. 
Therefore,  we  cannot  compute  optimal  policies  for  it  by  the  standard 
linear  programming  or  policy  Improvement  procedures  for  finite  state 
processes.  When  the  rewards  r(i,a)  are  nice  functions  of  i (like 
polynomials),  then  the  average  reward  for  a monotone  policy  f is 
tractable  (recall  Proposition  6.2),  and  optimal  policies  might  be 
obtainable  via  policy  improvement.  It  is  sometimes  feasible  to  com- 
pute monotone  optimal  policies  directly  from  the  function  for  a 
small  number  of  actions.  We  actually  did  this  for  two  actions  in 
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the  last  section.  In  this  section,  we  discuss  another  approach  for 

computing  monotone  average  optimal  policies.  This  is  similar  to  the 

linear  programming  approach  for  finite  state  processes. 

We  shall  consider  a random  walk  (such  as  in  Theorem  6.1)  which 

has  an  increasing  average  optimal  policy  f.  We  assume  the  following: 

Boundedness  Assumption.  There  is  an  N such  that  f(i)  = m for  all  i ^ N 

(l.e.  it  is  average  optimal  to  select  (p  ,q  ) when  the  walk  is  in 

mm 

locations  1 ^ N) . 

We  will  discuss  this  below.  We  also  assume,  for  simplicity,  that 

p >0  and  q > 1/2  for  all  a.  This  insures  that  each  policy  determines 
a a 

a positive  recurrent  random  walk. 

We  let  TT  * {7i(i,a):  i ^ 1,  1 4 a 4 m,  7t(l,m)  = 1 for  i ^ N} 
denote  a randomized  policy;  the  iT(l,a)  is  the  probability  of  selecting 
action  a when  the  walk  is  at  location  1,  and  action  m is  selected  for 
all  1 > N.  Under  the  policy  iv,  the  Markov  chains  {(X  ,a  )}  and  {X  } 
are  positive  recurrent.  Letting 

v(l,a)  » lim  * i,  a^  =■  a ] X^  = j), 

n-H» 


the  average  reward  is 
* m 

(1)  ^ ^ v(i,a)r(i,a) . 

1"0  a“l 

We  can  write 

(2)  v(i,a)  ■ v(l)n(i,a)  and  v(i) 


m 

E v(i,a) 
a»l 


where 

v(l)  - 11m  P (X  - 1 I X - j). 
n n ' o 

n-+<» 

Since  n(l,m)  - 1 for  1 ^ N,  then  by  Proposition  6.2  it  follows  that 
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(3)  v(l)  - v(N)(p  /q  for  1 ^ N. 

mm  ~ 

Consequently,  expression  (1)  simplifies  to 

N-1  ra  m 

1 T.  v(i,a)r(i,a)  + c(N)  Z v(N,a) 
1-0  a-1  a=l 


where 

CD 

(4)  c(N)  - I r(N+k,n.)(p  /q 

n>  ra 


The  problem  of  maximizing  the  (f>^  over  it  is  clearly  equivalent  to 

the  following  linear  programming  problem: 

N-1  m m 

max  5:  E v(i,a)r(i,a)  + c(N)  L v(N,a) 

v(i,a)  i=0  a»l  a=l 

Subject  to 


m 

I v(0,a) 
a-1 

m 

Z v(j,m) 
a-1 

for  1 4 J 4 N-1, 


Z [v(0,a)(l-q  ) + v(l,a)q  ] 
_i  <*  a. 


m 

Z [v(j-l,a)p  + v(j,a)(l-p  -q  ) + v(j+l,a)q  ] 
a=l  ^ ^ ** 


m m 

Z v(N,m)  = Z [v(N-l,a)p  + v(N,a)(l-p  -q  ) + v(N,a)(p  /q  )q  ] 

1 a aa  mma 

a-1  a-1 

N-1  m -1  ™ 

Z Z v(l,a)  + (1  - p /q  ) Z v(N,a)  = 1 
i-O  a-1  a-1 

0 4 v(0,a)  ;<  • • • 4 v(N,a)  ^ 1 for  1 4 a 4 m. 

Note  that  the  constraints  imply  that  v(l,a)  is  the  limiting  distribution 

of  optimal  solution  v(i,a)  of  the  linear  program,  determines 

(using  (2))  a monotone  average  optimal  policy 


n(i,a)  - v(i,a)(  Z v(l,a)) 
a-1 


ir(l,m)  - 1 


0 < 1 4 N-1 

1 > N. 


29 


This  optimal  policy  will  be  a nonrandom  policy  when  the  v(i,a) 
is  calculated  by  the  simplex  algorithm,  since  a nonrandom  optimal 
policy  exists.  Note  that  the  reason  we  could  reduce  our  problem  to 
a finite  variable  problem  is  that  the  limiting  distribution  v(l)  of 
our  random  walk  satisfies  (2). 

If  the  Boundedness  Assumption  does  not  hold,  then  the  above  pro- 
cedure is  still  useful.  It  may  not  yield  a truly  optimal  policy, 
but  it  will  yield  a suboptimal  policy  that  maximizes  over  all  increasing 
policies  Ti  which  select  action  m for  all  1 > N.  Such  a policy,  when  N 
is  large,  should  be  close  to  being  optimal. 

We  initially  thought  that  the  Boundedness  Assumption  could  be 
Justified  as  follows.  Consider  a random  walk  with  two  actions 

(Pltqi)  * ^Pm-l’'*m-l^  ’ rewards  f(i,l)  = r(i,l) 

and  f(l,2)  “ r(i,m),  where  the  probabilities  and  rewards  on  the  left 
of  the  equalities  are  from  the  random  walk  with  m actions.  Suppose 
that  an  average  optimal  policy  for  this  two  action  problem  is  to  select 
(Pl,<ll)  in  location  below  n*  and  select  (^2*^2^  otherwise.  (The  n* 
could  be  calculated  as  in  the  previous  section.)  Because  of  the  way 
we  defined  the  two  action  walk,  it  appears  that  n*  could  be  used  as  N in 
the  Boundedness  Assumption.  We  tried  very  hard  to  prove  this,  but  we 
could  not. 

9.  Mono^ne  Optimal  Policies  for  Finite  Time  Horizon^ 

The  above  analysis  for  random  walks  over  an  infinite  time  horizon 
can  also  be  done  for  walks  over  a finite  time  horizon.  To  illustrate 
this,  we  shall  present  a finite  time  horizon  analogue  of  Theorem  2.1. 

We  shall  consider  the  random  walk  as  in  Sections  1 and  2 for  N 
time  periods.  Nonstationary  rewards  and  policies  are  of  interest  for 
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finite  horizons.  Accordingly,  we  let  r^(l,a)  denote  the  reward  if  the 
walk  is  in  location  i at  time  n and  action  a is  taken.  (In  Section  2 
this  was  a'^r(i,a).)  A policy  is  a sequence  f = (f^^,  ....  fj^)  of  mappings 
from  the  state  space  {0,  1,  to  the  action  space  {1,  ...,  m}  with 

the  interpretation  that  action  is  taken,  i.e.  (p^  (i) ’'^f  (i)^ 

selected,  if  the  process  is  in  state  i at  time  n.  We  let 


V (i)  . £ r,^(X^.a^) 

..  ..  k=N-n 


^-n 


i) 


and 


V (i)  = sup  V , (i) . 

n j n,f 

A policy  f*  is  called  optimal  if 

V -.(i)  = V (i)  for  all  i. 

n,  r*  n 

The  Optimality  Criterion  for  finite  time  horizons  asserts  that  a 
policy  f is  optimal  if  and  only  if 

U^(i,fj^(i))  = max  U^(i,a), 
a 

where 

U^(i.a)  “ ^ P(i.a, j)V^_^(j) 

and  Is  the  zero  function.  We  shall  consider  the  optimal  policy  f 
defined  by 

f^(i)  “ maxfa:  U (i,a)  • max  U (i,a)}. 

a 

Theorejn  9.1.  Suppose  the  following  conditions  hold. 

(1)  P,  >*..>P„,  q,  <...<q,  and  p,  + q <1. 

1 — ml  m 1 ^m 

(2)  r'jj(l,l)  < •••  < r'j^(l,m)  for  all  1 and  n. 

(3)  r'^(l,l)  > r'^(i+l,m)  for  all  1 and  n. 

Then  fj^(i)  is  increasing  in  I for  each  n. 

Proof.  This  can  be  proved  as  we  proved  Theorem  2.1. 
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probabilities  p,  q and  l-p-q  (when  i»=0  these  probabilities  are  p,  0 and  1-p). 
This  is  repeated  indefinitely.  A rule  for  successively  selecting^  the 
probabilities  (p,q)  is  a control  po 1 i cy > <r Ye  i dent t fy  ^ond 1 tions^on  the  rewards 
and  probabilities  under  which  there  exist  monotonic  optimal  policies  for 
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20.  Abstract 

discounted  and  average  rewards.  For  example,  in  one  case  it  is  optimal 
to  increase  the  probabiiity  of  backward  steps  as  the  location  i increases. 
Our  results  are  based  on  (1)  a criterion  for  monotone  optimal  policies, 

(2)  a result  describing  when  an  upper  envelope  of  concave  functions  is 
concave,  and  (3)  a relation  between  optimal  policies  for  the  discounted  anc 
average  reward  criteria.  Procedures  for  computing  optimal  policies  are 
also  presented. > 
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